Among all cancer types, lung cancer is the most lethal among both males and females. It is responsible for 28% of cancer-related mortality. The life expectation for those who have lung cancer are very poor. It is usually diagnosed in an advanced phase, and the patients only have a 16% survival rate after 5 years of their diagnosis.
The Cancer Genome Atlas (TCGA) has comprehensively profiled this type of cancer in a patient cohort. Here, we analyze the expression profiles of those patients, using a pipeline based on the R/Bioconductor software package Rsubread.
This document is written in R markdown and
should be processed using R and you need to install the packages
knitr and markdown. Moreover, it using the official style for Bioconductor vignettes facilitated by the Bioconductor package BiocStyle. It is compiled from different files following the instructions included in the makefile.
Lung cancer, specifically lung squamous cell carcinoma, was assessed and analyzed in this experiment. We start importing the raw table of counts.
library(SummarizedExperiment)
se.unpaired <- readRDS(file.path("rawCounts", "seLUSC.rds"))
se.unpaired
class: RangedSummarizedExperiment
dim: 20115 553
metadata(5): experimentData annotation cancerTypeCode
cancerTypeDescription objectCreationDate
assays(1): counts
rownames(20115): 1 2 ... 102724473 103091865
rowData names(3): symbol txlen txgc
colnames(553): TCGA.18.3406.01A.01R.0980.07
TCGA.18.3407.01A.01R.0980.07 ... TCGA.90.7767.11A.01R.2125.07
TCGA.92.7340.11A.01R.2045.07
colData names(549): type bcr_patient_uuid ...
lymph_nodes_aortic_pos_by_ihc lymph_nodes_aortic_pos_total
table(se.unpaired$type)
normal tumor
51 502
From this, we see that 20115 genes from a total of 553 samples are in this data. We can also observe that this dataset contains 502 tumor samples and 51 normal samples, which matches the information of the dataset from The Cancer Genome Atlas (TCGA) program.
Next, we explored the column (phenotypic) data, which in this case corresponds to clinical variables, and their corresponding metadata. For example, the column “type” consists of tumor and normal labels. Other information such as gender, tumor stage, and smoking status are also shown.
In any case, we can access all the possible labels with the last line of code, which returns two columns of information about the clinical variables. One called labelDescription contains a succint description of the variable, often not more self-explanatory than the variable name itself, and the other called ‘CDEID’ corresponds to the so-called Common Data Element (CDE) identifier. This identifier can be use in https://cdebrowser.nci.nih.gov to search for further information about the associated clinical variable using the Advanced search form and the Public ID attribute search.
dim(colData(se.unpaired))
[1] 553 549
colData(se.unpaired)[1:5, 1:5]
DataFrame with 5 rows and 5 columns
type bcr_patient_uuid
<factor> <factor>
TCGA.18.3406.01A.01R.0980.07 tumor 95b83006-02c9-4c4d-bf84-a45115f7d86d
TCGA.18.3407.01A.01R.0980.07 tumor 4e1ad82e-23c8-44bb-b74e-a3d0b1126b96
TCGA.18.3408.01A.01R.0980.07 tumor d4bc755a-2585-4529-ae36-7e1d88bdecfe
TCGA.18.3409.01A.01R.0980.07 tumor b09e872a-e837-49ec-8a27-84dcdcabf347
TCGA.18.3410.01A.01R.0980.07 tumor 99599b60-4f5c-456b-8755-371b1aa7074e
bcr_patient_barcode form_completion_date
<factor> <factor>
TCGA.18.3406.01A.01R.0980.07 TCGA-18-3406 2011-3-9
TCGA.18.3407.01A.01R.0980.07 TCGA-18-3407 2011-3-9
TCGA.18.3408.01A.01R.0980.07 TCGA-18-3408 2011-3-9
TCGA.18.3409.01A.01R.0980.07 TCGA-18-3409 2011-3-17
TCGA.18.3410.01A.01R.0980.07 TCGA-18-3410 2011-4-4
prospective_collection
<factor>
TCGA.18.3406.01A.01R.0980.07 NO
TCGA.18.3407.01A.01R.0980.07 NO
TCGA.18.3408.01A.01R.0980.07 NO
TCGA.18.3409.01A.01R.0980.07 NO
TCGA.18.3410.01A.01R.0980.07 NO
mcols(colData(se.unpaired), use.names=TRUE)
DataFrame with 549 rows and 2 columns
labelDescription
<character>
type sample type (tumor/normal)
bcr_patient_uuid bcr patient uuid
bcr_patient_barcode bcr patient barcode
form_completion_date form completion date
prospective_collection tissue prospective collection indicator
... ...
lymph_nodes_pelvic_pos_total total pelv lnp
lymph_nodes_aortic_examined_count total aor lnr
lymph_nodes_aortic_pos_by_he aln pos light micro
lymph_nodes_aortic_pos_by_ihc aln pos ihc
lymph_nodes_aortic_pos_total total aor-lnp
CDEID
<character>
type NA
bcr_patient_uuid NA
bcr_patient_barcode 2673794
form_completion_date NA
prospective_collection 3088492
... ...
lymph_nodes_pelvic_pos_total 3151828
lymph_nodes_aortic_examined_count 3104460
lymph_nodes_aortic_pos_by_he 3151832
lymph_nodes_aortic_pos_by_ihc 3151831
lymph_nodes_aortic_pos_total 3151827
Now, we explore the row (feature) data, which provide information on genes. For the first line of command, we can see the length of each gene and the GC content. The second line of command provides further information, such as the ranges within individual chromosome.
rowData(se.unpaired)
DataFrame with 20115 rows and 3 columns
symbol txlen txgc
<character> <integer> <numeric>
1 A1BG 3322 0.564419024683925
2 A2M 4844 0.488232865400495
9 NAT1 2280 0.394298245614035
10 NAT2 1322 0.389561270801815
12 SERPINA3 3067 0.524942940984676
... ... ... ...
100996331 POTEB 1706 0.430832356389215
101340251 SNORD124 104 0.490384615384615
101340252 SNORD121B 81 0.407407407407407
102724473 GAGE10 538 0.505576208178439
103091865 BRWD1-IT2 1028 0.592412451361868
rowRanges(se.unpaired)
GRanges object with 20115 ranges and 3 metadata columns:
seqnames ranges strand | symbol txlen
<Rle> <IRanges> <Rle> | <character> <integer>
1 chr19 58345178-58362751 - | A1BG 3322
2 chr12 9067664-9116229 - | A2M 4844
9 chr8 18170477-18223689 + | NAT1 2280
10 chr8 18391245-18401218 + | NAT2 1322
12 chr14 94592058-94624646 + | SERPINA3 3067
... ... ... ... . ... ...
100996331 chr15 20835372-21877298 - | POTEB 1706
101340251 chr17 40027542-40027645 - | SNORD124 104
101340252 chr9 33934296-33934376 - | SNORD121B 81
102724473 chrX 49303669-49319844 + | GAGE10 538
103091865 chr21 39313935-39314962 + | BRWD1-IT2 1028
txgc
<numeric>
1 0.564419024683925
2 0.488232865400495
9 0.394298245614035
10 0.389561270801815
12 0.524942940984676
... ...
100996331 0.430832356389215
101340251 0.490384615384615
101340252 0.407407407407407
102724473 0.505576208178439
103091865 0.592412451361868
-------
seqinfo: 455 sequences (1 circular) from hg38 genome
In this analysis, we wanted to compare tumor and normal cells. In order to do this, we wanted to compare tumor and normal cells that come from the same patient. This also allows to pair the samples which reduces the possibility of false positives. In order to get them, we execute the following code.
# get the number of occurences of each patient
occur <- data.frame(table(substr(colnames(se.unpaired), 9, 12)))
# get those that occur twice
paired_df <- occur[occur$Freq > 1,]
paired <- as.vector(paired_df$Var1)
paired <- paired[2:length(paired)]
mask <- is.element(substr(colnames(se.unpaired), 9, 12), paired)
se <- se.unpaired[ ,mask]
saveRDS(se, file.path("results", "se.paired.rds"))
Since this data is unprocessed, we must first perform a quality assessment and normalize accordingly. To do this, we need first to load the edgeR R/Bioconductor package. This requires the creation of a `DGEList’ object, as the package doesn’t work directly with SummarizedExperiment objects. In any case, we will update any change in the DGElist object to the SummarizedExperiment object.
library(edgeR)
dge <- DGEList(counts=assays(se)$counts, genes=mcols(se))
saveRDS(dge, file.path("results", "dge.paired.rds"))
One way of normalizing is to use the counts per million (CPM) values. This value is calculated by dividing the number of counts of each sample by 1 million. Instead of using this directly, we calculate \(\log_2\) CPM values of expression because it has nicer distributional properties than raw counts or non-logged CPM units. We set this information as an additional assay element to ease their manipulation.
assays(se)$logCPM <- cpm(dge, log=TRUE, prior.count=0.5)
assays(se)$logCPM[1:5, 1:5]
TCGA.22.4593.01A.21R.1820.07 TCGA.22.4609.01A.21R.2125.07
1 2.433139 1.760377
2 9.013463 10.810625
9 -6.817220 -6.817220
10 -6.817220 -6.817220
12 5.605909 5.647356
TCGA.22.5471.01A.01R.1635.07 TCGA.22.5472.01A.01R.1635.07
1 2.122666 0.775287
2 7.713750 11.100735
9 -6.817220 -6.817220
10 -6.817220 -6.817220
12 4.823947 5.856939
TCGA.22.5478.01A.01R.1635.07
1 3.713844
2 9.448174
9 -6.817220
10 -6.817220
12 4.920820
Let’s examine the sequencing depth in terms of total number of sequence read counts mapped to the genome per sample. Figure 1 below shows the sequencing depth per sample, also known as library sizes, in increasing order.
Figure 1: Library sizes in increasing order
This figure reveals no big changes in sequencing depth between samples. In addition, we see a uniform distribution of tumor and normal samples across the figure. We see a cluster of normal samples on the right side with a higher cpm, but don’t believe this will affect our analyses. We can obtain the samples with less sequencing depth using the commands below.
sampledepth <- round(dge$sample$lib.size / 1e6, digits=1)
names(sampledepth) <- substr(colnames(se), 6, 12)
sort(sampledepth)
43.6773 56.7823 77.7335 56.8309 56.8083 58.8386 77.7335 43.6773 56.7582
28.8 29.0 31.0 32.0 32.9 33.4 34.9 35.0 35.9
56.8201 56.8201 56.8623 34.8454 77.7142 56.7579 56.7730 56.8083 56.8082
36.5 37.0 37.0 37.6 38.7 39.7 40.2 40.3 40.4
56.8082 51.4081 34.8454 22.5481 56.8309 77.7337 33.4587 56.7823 56.7580
40.7 41.9 42.1 42.3 43.6 44.0 45.6 45.7 45.8
56.7579 85.7710 58.8386 34.7107 56.7731 22.5471 77.7142 56.7580 56.7731
45.9 46.5 46.6 47.3 47.6 47.8 48.4 49.7 50.0
85.7710 43.3394 39.5040 22.5472 60.2709 51.4079 90.6837 51.4080 77.7338
50.1 50.2 51.0 51.1 51.2 51.5 51.6 51.9 52.0
56.7222 77.7338 77.7138 56.7582 22.5471 43.5670 22.4609 22.5483 92.7340
52.4 52.4 52.8 52.9 53.6 53.6 54.2 54.4 54.6
77.8008 34.7107 90.7767 43.7657 22.4609 43.7657 77.7138 33.6737 43.7658
54.9 55.1 55.7 55.8 56.3 56.7 57.0 57.2 57.4
77.7337 77.8007 43.5670 56.7222 56.8623 33.4587 43.7658 43.6143 22.5481
57.6 57.6 58.0 58.0 59.1 59.6 59.9 60.0 60.3
22.5491 43.6647 92.7340 39.5040 22.5489 22.5482 77.8008 22.5489 22.5491
61.0 61.3 61.3 61.5 63.5 64.5 65.2 65.5 66.0
77.8007 22.5478 90.6837 90.7767 56.7730 33.6737 60.2709 22.5478 43.6143
66.4 68.0 68.2 68.4 68.8 69.3 69.8 71.0 75.9
22.5472 22.5482 22.4593 43.6771 22.5483 43.6647 22.4593 43.6771 51.4081
76.2 77.0 77.9 79.4 80.3 81.5 85.0 88.8 103.7
51.4080 51.4079 43.3394
119.5 122.6 124.1
Next, we will look at the distribution of expression values per sample in terms of logarithmic CPM units. We display tumor and normal samples separately, and are shown in Figure 2. Each colored line in the graphs below represent a sample. In both graphs, we see two modes. The low mode represents genes not expressed in that sample. The second mode represents the genes that are expressed.
Figure 2: Non-parametric density distribution of expression profiles per sample
In both graphs, we see some genes that deviate a little from the rest. These could be potential outliers and we have noted them for later analysis.
Next, we wanted to look at the distribution of expression levels across genes. To do this, we calculate the average expression per gene through all the samples. Figure 3 shows the distribution of those values.
Figure 3: Distribution of average expression level per gene
RNA sequence expression profiles that come from lowly-expressed genes can lead to artifacts in downstream differential expression analyses. Thus, it is common to set a threshold and filter out genes that fall below this value. In the light of this plot above, we may consider a cutoff of 1 log CPM unit as minimum value of expression to select genes being expressed across samples. Using this cutoff we proceed to filter out lowly-expressed genes. Now, we have a total of 11872 genes and 102 samples.
mask <- avgexp > 1
dim(se)
[1] 20115 102
se.filt <- se[mask, ]
dim(se.filt)
[1] 11872 102
dge.filt <- dge[mask, ]
dim(dge.filt)
[1] 11872 102
We then store the un-normalized versions of the filtered expression data.
saveRDS(se.filt, file.path("results", "se.paired.filt.unnorm.rds"))
saveRDS(dge.filt, file.path("results", "dge.paired.filt.unnorm.rds"))
Next, we calculate the normalization factors on the filtered expression data set using the calcNormFactors function.
dge.filt <- calcNormFactors(dge.filt)
Replace the raw log2 CPM units in the corresponding assay element of the SummarizedExperiment object, by the normalized ones.
assays(se.filt)$logCPM <- cpm(dge.filt, log=TRUE, normalized.lib.sizes=TRUE, prior.count=0.25)
Store normalized versions of the filtered expression data.
saveRDS(se.filt, file.path("results", "se.paired.filt.rds"))
saveRDS(dge.filt, file.path("results", "dge.paired.filt.rds"))
We examine now the MA-plots of the normalized expression profiles. These plots allow us to visualize how one sample compares to the average of the rest of the samples. We look first to the tumor samples in Figure 4.
Figure 4: MA-plots of the tumor samples
In the MA-plots of the tumor samples, we observed the following:
We observe that the appearance of samples that differ from the mean can be problematic. If during downstream analysis we observe unexpected results, those samples should be removed together with their normal pairs. We can obtain their names with the following code:
big_c
[1] "TCGA.56.7823" "TCGA.56.8623" "TCGA.77.7138"
Next, we look now to the normal samples in Figure 5.
Figure 5: MA-plots of the normal samples
In this case, we observed:
In conclusion, we do not observe either important expression-level dependent biases among the normal samples.
Batch effects are sub-groups of measurements that have a qualitatively different behavior across conditions and are unrelated to the variables in the study. It is important to determine if batch effects are present to know if any confounding variables are affecting the data. We will search now for potential surrogate of batch effect indicators within normal and tumor samples. Given that each sample names corresponds to a TCGA barcode (see https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode), following the strategy described in http://bioinformatics.mdanderson.org/main/TCGABatchEffects:Overview we are going to derive different elements of the TCGA barcode and examine their distribution across samples.
tss <- substr(colnames(se.filt), 6, 7)
table(data.frame(TYPE=se.filt$type, TSS=tss))
TSS
TYPE 22 33 34 39 43 51 56 58 60 77 85 90 92
normal 10 2 2 1 8 3 12 1 1 7 1 2 1
tumor 10 2 2 1 8 3 12 1 1 7 1 2 1
center <- substr(colnames(se.filt), 27, 28)
table(data.frame(TYPE=se.filt$type, CENTER=center))
CENTER
TYPE 07
normal 51
tumor 51
plate <- substr(colnames(se.filt), 22, 25)
table(data.frame(TYPE=se.filt$type, PLATE=plate))
PLATE
TYPE 0980 1100 1635 1758 1820 1858 1949 2045 2125 2187 2247 2296 2326
normal 0 0 5 4 7 1 4 10 10 2 4 2 1
tumor 1 3 6 0 7 0 4 10 10 2 4 2 1
PLATE
TYPE A28V
normal 1
tumor 1
portionanalyte <- substr(colnames(se.filt), 18, 20)
table(data.frame(TYPE=se.filt$type, PORTIONANALYTE=portionanalyte))
PORTIONANALYTE
TYPE 01R 11R 21R 31R 41R
normal 48 3 0 0 0
tumor 11 28 8 2 2
samplevial <- substr(colnames(se.filt), 14, 16)
table(data.frame(TYPE=se.filt$type, SAMPLEVIAL=samplevial))
SAMPLEVIAL
TYPE 01A 01B 11A
normal 0 0 51
tumor 50 1 0
From this information we can make the following observations:
colnames(se.filt)[samplevial == "01B"]
[1] "TCGA.56.7823.01B.11R.2247.07"
With those results, we use the portion of analyte as surrogate of batch effect indicator. Considering our outcome of interest as molecular changes between sample types, tumor vs. normal, we will examine now the cross-classification of this outcome with portion of analyte.
table(data.frame(TYPE=se.filt$type, PORTIONANALYTE=portionanalyte))
PORTIONANALYTE
TYPE 01R 11R 21R 31R 41R
normal 48 3 0 0 0
tumor 11 28 8 2 2
As we can see here, the majority of normal samples (48) come from the “01R”" analyte and 3 come from the “11R” analyte. However, the tumor samples span across five different analytes. This could potentially lead to a batch effect.
In order to examine how the samples group together, we use two approaches: hierarchical clustering and multidimensional scaling, annotating the outcome of interest and the surrogate of batch indicator, or portion of analyte in our case.
We perform this under a subset of our samples, as we wish to have a matrix without zeros. We remove samples belonging to portions 21R, 31R and 41R; and as the experiment is paired, also their pairs.
por1 <- as.vector(substr(colnames(se.filt)[portionanalyte == "41R"], 9, 12))
por2 <- as.vector(substr(colnames(se.filt)[portionanalyte == "31R"], 9, 12))
por3 <- as.vector(substr(colnames(se.filt)[portionanalyte == "21R"], 9, 12))
allpor <- c(por1, por2, por3)
table(is.element(substr(colnames(se.filt), 9, 12), allpor))
FALSE TRUE
78 24
se.batch <- se.filt[ ,!is.element(substr(colnames(se.filt), 9, 12), allpor)]
dge.batch <- DGEList(counts=assays(se.batch)$counts, genes=mcols(se.batch))
new.portionanalyte <- substr(colnames(se.batch), 18, 20)
table(data.frame(TYPE=se.batch$type, PORTIONANALYTE=new.portionanalyte))
PORTIONANALYTE
TYPE 01R 11R
normal 36 3
tumor 11 28
dim(dge.batch)
[1] 11872 78
We see that we are able to remove 24 samples, which correspond to the 12 in the portions we want to remove and their pairs.
We calculate again log CPM values with a higher prior count to moderate extreme fold-changes produced by low counts. The resulting dendrogram is shown in Figure 6.
logCPM <- cpm(dge.batch, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(factor(new.portionanalyte))
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(se.batch)
outcome <- paste(substr(colnames(se.batch), 9, 12), as.character(se.batch$type), sep="-")
names(outcome) <- colnames(se.batch)
sampleDendrogram <- dendrapply(sampleDendrogram,
function(x, batch, labels) {
if (is.leaf(x)) {
attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
attr(x, "label") <- as.vector(labels[attr(x, "label")])
}
x
}, batch, outcome)
plot(sampleDendrogram, main="Hierarchical clustering of samples")
legend("topright", paste("Batch", sort(unique(batch)), levels(factor(new.portionanalyte))), fill=sort(unique(batch)))
Figure 6: Hierarchical clustering of the samples
We can observe that samples cluster primarily by sample type, tumor or normal. We observe that the different portions are present in both clusters, so we don’t consider that it is inducing batch effect.
We also see that one of the tumor samples, 8623-tumor, is present in the normal samples cluster. This one also had a strong deviation in the MA plot, so we might consider discarding it.
In Figure 7 we show the corresponding MDS plot. Here we see more clearly that the first source of variation separates tumor from normal samples, and we don’t see any sample that is separated from any cluster.
plotMDS(dge.batch, labels=outcome, col=batch)
legend("bottomleft", paste("Batch", sort(unique(batch)), levels(factor(new.portionanalyte))),
fill=sort(unique(batch)), inset=0.05)
Figure 7: Multidimensional scaling plot of the samples
Finally, under the consideration that there is no batch effect, the 24 samples that were removed in order to make the dendogram and the MDS plot will remain for further analysis. We decide to generate the dendogram in Figure 8 again in order to see if any other sample appears in its opposite cluster, apart from 8623-tumor.
logCPM <- cpm(dge.filt, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(factor(portionanalyte))
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(se.filt)
outcome <- paste(substr(colnames(se.filt), 9, 12), as.character(se.filt$type), sep="-")
names(outcome) <- colnames(se.filt)
sampleDendrogram <- dendrapply(sampleDendrogram,
function(x, batch, labels) {
if (is.leaf(x)) {
attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
attr(x, "label") <- as.vector(labels[attr(x, "label")])
}
x
}, batch, outcome)
plot(sampleDendrogram, main="Hierarchical clustering of samples")
legend("topright", paste("Batch", sort(unique(batch)), levels(factor(new.portionanalyte))), fill=sort(unique(batch)))
Figure 8: Hierarchical clustering of the samples
We observe that only the 8623-tumor sample doesn’t cluster with the rest of the tumor samples.
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.4
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] ca_ES.UTF-8/ca_ES.UTF-8/ca_ES.UTF-8/C/ca_ES.UTF-8/ca_ES.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] geneplotter_1.60.0 annotate_1.60.1
[3] XML_3.98-1.19 AnnotationDbi_1.44.0
[5] lattice_0.20-38 edgeR_3.24.3
[7] limma_3.38.3 SummarizedExperiment_1.12.0
[9] DelayedArray_0.8.0 BiocParallel_1.16.6
[11] matrixStats_0.54.0 Biobase_2.42.0
[13] GenomicRanges_1.34.0 GenomeInfoDb_1.18.2
[15] IRanges_2.16.0 S4Vectors_0.20.1
[17] BiocGenerics_0.28.0 knitr_1.22
[19] BiocStyle_2.10.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 RColorBrewer_1.1-2 compiler_3.5.3
[4] BiocManager_1.30.4 highr_0.8 XVector_0.22.0
[7] bitops_1.0-6 tools_3.5.3 zlibbioc_1.28.0
[10] bit_1.1-14 digest_0.6.18 memoise_1.1.0
[13] RSQLite_2.1.1 evaluate_0.13 Matrix_1.2-17
[16] DBI_1.0.0 yaml_2.2.0 xfun_0.6
[19] GenomeInfoDbData_1.2.0 stringr_1.4.0 bit64_0.9-7
[22] locfit_1.5-9.1 grid_3.5.3 rmarkdown_1.12
[25] bookdown_0.9 blob_1.1.1 magrittr_1.5
[28] codetools_0.2-16 htmltools_0.3.6 xtable_1.8-4
[31] KernSmooth_2.23-15 stringi_1.4.3 RCurl_1.95-4.12
We perform a simple examination of expression changes and their associated p-values using the R/Bioconductor package sva.
In order to use the method, a matrix of the model, including adjustment variables, and a null model which only includes the adjustment variables have to be created. In our case there are no factors to adjust, so we give ~1 as the intercept term in the model.
library(sva)
mod <- model.matrix(~type + 1, data = colData(se.filt))
mod0 <- model.matrix(~1, data = colData(se.filt))
sv <- sva(assays(se.filt)$logCPM, mod, mod0)
Number of significant surrogate variables is: 22
Iteration (out of 5 ):1 2 3 4 5
pv <- f.pvalue(assays(se.filt)$logCPM, mod, mod0)
sum(p.adjust(pv, method="fdr") < 0.01)
[1] 8477
There are 8477 genes changing significantly their expression at FDR < 1%. In Figure 9 below we show the distribution of the resulting p-values. This distribution should be uniform with a peak at low p-values for the truly DE genes, assuming that most genes are not differentially expressed.
Figure 9: Distribution of raw p-values for an F-test on every gene between tumor and normal samples
The sva() function is able to estimate surrogate variables, which have continuous values that can be used to correct and adjust unmeasured and unwanted effects.
sv <- sva(assays(se.filt)$logCPM, mod, mod0)
Number of significant surrogate variables is: 21
Iteration (out of 5 ):1 2 3 4 5
sv$n
[1] 21
The SVA algorithm has found 21 surrogate variables. Let’s use them to assess againt the extent of differential expression this time adjusting for these surrogate variables.
modsv <- cbind(mod, sv$sv)
mod0sv <- cbind(mod0, sv$sv)
pvsv <- f.pvalue(assays(se.filt)$logCPM, modsv, mod0sv)
sum(p.adjust(pvsv, method="fdr") < 0.01)
[1] 9457
We have increased the number of changing genes to 9457. Figure 10 shows the resulting distribution of p-values.
Figure 10: Distribution of raw p-values for an F-test on every gene between tumor and normal samples, adjusting for surrogate variables estimated with SVA
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.4
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] ca_ES.UTF-8/ca_ES.UTF-8/ca_ES.UTF-8/C/ca_ES.UTF-8/ca_ES.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] sva_3.30.1 genefilter_1.64.0
[3] mgcv_1.8-28 nlme_3.1-139
[5] geneplotter_1.60.0 annotate_1.60.1
[7] XML_3.98-1.19 AnnotationDbi_1.44.0
[9] lattice_0.20-38 edgeR_3.24.3
[11] limma_3.38.3 SummarizedExperiment_1.12.0
[13] DelayedArray_0.8.0 BiocParallel_1.16.6
[15] matrixStats_0.54.0 Biobase_2.42.0
[17] GenomicRanges_1.34.0 GenomeInfoDb_1.18.2
[19] IRanges_2.16.0 S4Vectors_0.20.1
[21] BiocGenerics_0.28.0 knitr_1.22
[23] BiocStyle_2.10.0
loaded via a namespace (and not attached):
[1] locfit_1.5-9.1 xfun_0.6 splines_3.5.3
[4] htmltools_0.3.6 yaml_2.2.0 blob_1.1.1
[7] survival_2.44-1.1 DBI_1.0.0 bit64_0.9-7
[10] RColorBrewer_1.1-2 GenomeInfoDbData_1.2.0 stringr_1.4.0
[13] zlibbioc_1.28.0 codetools_0.2-16 evaluate_0.13
[16] memoise_1.1.0 highr_0.8 Rcpp_1.0.1
[19] xtable_1.8-4 BiocManager_1.30.4 XVector_0.22.0
[22] bit_1.1-14 digest_0.6.18 stringi_1.4.3
[25] bookdown_0.9 grid_3.5.3 tools_3.5.3
[28] bitops_1.0-6 magrittr_1.5 RCurl_1.95-4.12
[31] RSQLite_2.1.1 Matrix_1.2-17 rmarkdown_1.12
[34] compiler_3.5.3
Here we will perform the functional analysis in the second part of the project.
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.4
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] ca_ES.UTF-8/ca_ES.UTF-8/ca_ES.UTF-8/C/ca_ES.UTF-8/ca_ES.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] sva_3.30.1 genefilter_1.64.0
[3] mgcv_1.8-28 nlme_3.1-139
[5] geneplotter_1.60.0 annotate_1.60.1
[7] XML_3.98-1.19 AnnotationDbi_1.44.0
[9] lattice_0.20-38 edgeR_3.24.3
[11] limma_3.38.3 SummarizedExperiment_1.12.0
[13] DelayedArray_0.8.0 BiocParallel_1.16.6
[15] matrixStats_0.54.0 Biobase_2.42.0
[17] GenomicRanges_1.34.0 GenomeInfoDb_1.18.2
[19] IRanges_2.16.0 S4Vectors_0.20.1
[21] BiocGenerics_0.28.0 knitr_1.22
[23] BiocStyle_2.10.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 RColorBrewer_1.1-2 compiler_3.5.3
[4] BiocManager_1.30.4 XVector_0.22.0 bitops_1.0-6
[7] tools_3.5.3 zlibbioc_1.28.0 bit_1.1-14
[10] digest_0.6.18 memoise_1.1.0 evaluate_0.13
[13] RSQLite_2.1.1 Matrix_1.2-17 DBI_1.0.0
[16] yaml_2.2.0 xfun_0.6 GenomeInfoDbData_1.2.0
[19] stringr_1.4.0 bit64_0.9-7 locfit_1.5-9.1
[22] grid_3.5.3 survival_2.44-1.1 rmarkdown_1.12
[25] bookdown_0.9 blob_1.1.1 magrittr_1.5
[28] splines_3.5.3 htmltools_0.3.6 xtable_1.8-4
[31] stringi_1.4.3 RCurl_1.95-4.12